By: Arun Dhingra and Arjun Kothakota
!pip3 install plotly
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
from numpy import exp
import numpy as np
import seaborn as sns
import re
import matplotlib.pyplot as plt
import plotly
import plotly.graph_objs as go
import plotly.offline as offline
from plotly.graph_objs import *
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import warnings
warnings.simplefilter('ignore')
init_notebook_mode(connected=False)
The US Elections are always a source of controversy. In the past, a few presidential candidates have been able to win without securing the majority of voters in America. For example, in 2000, Al Gore won the popular vote, but lost the presidential election to George Bush. However, the electoral college debate has been struck up again in 2016 due to Hillary Clinton's loss despite gaining a 1,000,000-vote lead against her opponent, Donald Trump. https://en.wikipedia.org/wiki/List_of_United_States_presidential_elections_in_which_the_winner_lost_the_popular_vote
In the Electoral College, each state is assigned a number of votes (based on its population) which is updated every decade with the results of the decennial census. For example, California, the most populous state, has 55 electoral votes, while Wyoming has 3 electoral votes. The Electoral College determines its votes based on the popular votes of each state. For example, if candidate who wins the majority vote in California will get all 55 of its electoral votes. However, there are two exceptions, Maine and Nebraska split up their 3 popular votes based on the popular vote in each congressional district.
One of the main criticisms of the Electoral College is that the ‘winner takes all’ system for each state seem to neglect all of the voters who actually voted for the candidate. The origins of the electoral college have also been correlated with voter suppression; in the words of James Madison, “The right of suffrage was much more diffusive in the Northern than the Southern States; and the latter could have no influence in the election on the score of the Negroes.” Additionally, the results of the electoral college seem to be a source of resounding misinformation. Many people tend to see a map of electoral college results and think that there is a blatant majority of winners; however, these maps against states do not accurately reflect the outcomes of an election.
In this notebook, we delve into the misconceptions about the results of elections and take a critical position against the electoral college. We will first look into how much a winning candidate from an election year actually wins in each state and look into how it affects the results of the electoral college. We will then attempt to predict the outcomes of the 2020 presidential election using a model trained on previous presidential election data.
%%html
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Spotted: A map to be hung somewhere in the West Wing <a href="https://t.co/TpPPDyNFtE">pic.twitter.com/TpPPDyNFtE</a></p>— Trey Yingst (@TreyYingst) <a href="https://twitter.com/TreyYingst/status/862669407868391424?ref_src=twsrc%5Etfw">May 11, 2017</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
winners = {
'democrat': [1976, 1992, 1996, 2008, 2012],
'republican': [1980, 1984, 1988, 2000, 2004, 2016]
}
parties = ['democrat', 'republican']
pop_cand = pd.read_csv("1976-2016-president.csv")
pop_cand = pop_cand[pop_cand['writein'] == False]
pop_cand = pop_cand.drop(columns=['office','version', 'notes', 'state_cen', 'state_ic', 'writein'], axis=0)
pop_cand = pop_cand.dropna(axis=0)
pop_cand.head(20)
pop_cand = pop_cand[pop_cand['party'].isin(parties)]
We have used a multitude of datasets in the upcoming analysis. We have found election results by state, electoral votes by state over time, poverty by state over time, and racial distribution of races between states over time. Since there is surprisingly very limited (free) data for the majority of the election years – mainly in the 1900s – we were restricted to using election and socioeconomic data from the 1976 presidential election to the 2016 presidential election.
Our first dataset contains the presidential election data in each state; each candidate has their total number of votes with their name and party. Additionally, there is a column that keeps track of the total number of casted votes in each state for each year. Since we are measuring the representation of candidates among states, this dataset is essential to our analysis.
# useful dicts
fips = {"01": "Alabama",
"02": "Alaska",
"04": "Arizona",
"05": "Arkansas",
"06": "California",
"08": "Colorado",
"09": "Connecticut",
"10": "Delaware",
"11": "District of Columbia",
"12": "Florida",
"13": "Georgia",
"15": "Hawaii",
"16": "Idaho",
"17": "Illinois",
"18": "Indiana",
"19": "Iowa",
"20": "Kansas",
"21": "Kentucky",
"22": "Louisiana",
"23": "Maine",
"24": "Maryland",
"25": "Massachusetts",
"26": "Michigan",
"27": "Minnesota",
"28": "Mississippi",
"29": "Missouri",
"30": "Montana",
"31": "Nebraska",
"32": "Nevada",
"33": "New Hampshire",
"34": "New Jersey",
"35": "New Mexico",
"36": "New York",
"37": "North Carolina",
"38": "North Dakota",
"39": "Ohio",
"40": "Oklahoma",
"41": "Oregon",
"42": "Pennsylvania",
"44": "Rhode Island",
"45": "South Carolina",
"46": "South Dakota",
"47": "Tennessee",
"48": "Texas",
"49": "Utah",
"50": "Vermont",
"51": "Virginia",
"53": "Washington",
"54": "West Virginia",
"55": "Wisconsin",
"56": "Wyoming"
}
us_state_abbrev = {
'Alabama': 'AL',
'Alaska': 'AK',
'American Samoa': 'AS',
'Arizona': 'AZ',
'Arkansas': 'AR',
'California': 'CA',
'Colorado': 'CO',
'Connecticut': 'CT',
'Delaware': 'DE',
'District of Columbia': 'DC',
'Florida': 'FL',
'Georgia': 'GA',
'Guam': 'GU',
'Hawaii': 'HI',
'Idaho': 'ID',
'Illinois': 'IL',
'Indiana': 'IN',
'Iowa': 'IA',
'Kansas': 'KS',
'Kentucky': 'KY',
'Louisiana': 'LA',
'Maine': 'ME',
'Maryland': 'MD',
'Massachusetts': 'MA',
'Michigan': 'MI',
'Minnesota': 'MN',
'Mississippi': 'MS',
'Missouri': 'MO',
'Montana': 'MT',
'Nebraska': 'NE',
'Nevada': 'NV',
'New Hampshire': 'NH',
'New Jersey': 'NJ',
'New Mexico': 'NM',
'New York': 'NY',
'North Carolina': 'NC',
'North Dakota': 'ND',
'Northern Mariana Islands':'MP',
'Ohio': 'OH',
'Oklahoma': 'OK',
'Oregon': 'OR',
'Pennsylvania': 'PA',
'Puerto Rico': 'PR',
'Rhode Island': 'RI',
'South Carolina': 'SC',
'South Dakota': 'SD',
'Tennessee': 'TN',
'Texas': 'TX',
'Utah': 'UT',
'Vermont': 'VT',
'Virgin Islands': 'VI',
'Virginia': 'VA',
'Washington': 'WA',
'West Virginia': 'WV',
'Wisconsin': 'WI',
'Wyoming': 'WY'
}
state_region = {
'AK': 'Other',
'AL': 'South',
'AR': 'South',
'AZ': 'West',
'CA': 'West',
'CO': 'West',
'CT': 'NorthEast',
'DC': 'NorthEast',
'DE': 'NorthEast',
'FL': 'South',
'GA': 'South',
'HI': 'Other',
'IA': 'MidWest',
'ID': 'West',
'IL': 'MidWest',
'IN': 'MidWest',
'KS': 'MidWest',
'KY': 'South',
'LA': 'South',
'MA': 'NorthEast',
'MD': 'NorthEast',
'ME': 'NorthEast',
'MI': 'West',
'MN': 'MidWest',
'MO': 'MidWest',
'MS': 'South',
'MT': 'West',
'NC': 'South',
'ND': 'MidWest',
'NE': 'West',
'NH': 'NorthEast',
'NJ': 'NorthEast',
'NM': 'West',
'NV': 'West',
'NY': 'NorthEast',
'OH': 'MidWest',
'OK': 'South',
'OR': 'West',
'PA': 'NorthEast',
'RI': 'NorthEast',
'SC': 'South',
'SD': 'MidWest',
'TN': 'South',
'TX': 'South',
'UT': 'West',
'VA': 'South',
'VT': 'NorthEast',
'WA': 'West',
'WI': 'MidWest',
'WV': 'South',
'WY': 'West'
}
def fips_to_state(fips_id):
fips_id = fips_id[:2]
return fips[fips_id] if fips_id in fips else 'NaN'
is_party_winner = {
'democrat': [1976, 1992, 1996, 2008, 2012],
'republican': [1980, 1984, 1988, 2000, 2004, 2016]
}
TOTAL_ELEC_VOTES = 538
ELECTION_YEARS = list(np.arange(1976, 2017, 4))
cand_votes = pd.read_csv("1976-2016-president.csv")
cand_votes = cand_votes[cand_votes['writein'] == False]
cand_votes = cand_votes.drop(columns=['office','version', 'notes', 'state_cen', 'state_ic', 'writein'], axis=0)
cand_votes = cand_votes.dropna(axis=0)
cand_votes.columns = ['year', 'state', 'state_code', 'state_fips', 'cand', 'party', 'cand_votes', 'tot_votes']
cand_votes['party'] = cand_votes['party'].apply(lambda x: 'democrat' if x == 'democratic-farmer-labor' else x)
cand_votes['region'] = cand_votes['state_code'].apply(lambda x: state_region[x])
cand_votes.head(20)
Since the distribution of electoral votes differ depending on the population density of states, it was necessary to obtain electoral vote distributions for each decade that we are analyzing election data. The data was obtained by a graphic that was painstakingly transcriped to an excel file.
# This has been based off of popular vote, however, how about the electoral college?
electoral = pd.read_excel("electoral-dist-1900-2016.xlsx")
state = []
years = []
votes = []
updated_elec_votes = pd.DataFrame()
for index, row in electoral.iterrows():
for year in ELECTION_YEARS:
years.append(year)
state.append(row['year'])
votes.append(row[year])
updated_elec_votes['state'] = state
updated_elec_votes['year'] = years
updated_elec_votes['votes'] = votes
updated_elec_votes.columns = ['state', 'year', 'elec_votes']
updated_elec_votes
The distribution of race by state data unfortunately took a long time to collect and process. There is an intercensal database that houses race distribution; however, it only reaches back to the 1990s. The 1970’s and 1980’s data also unfortunately used somewhat different structuring in comparison to the other data. The data was split on FIPS state and county codes, which creates an ID for a specific county by prefixing a state-id to it. Thus, the datasets needed to be highly cleaned to conform to the intuitive structure from the intercensal database.
Additionally, we did have to compromise some races. Prior to the 90’s, races were split by White, Black, and Other; however, in the other datasets, races were also split into ‘Native American’ and ‘Pacific Islander’. In order to maintain conformity of the datasets, the ‘Native American’ and ‘Pacific Islander’ data were aggregated into the Other category.
The intercensal data used one column to denote race for each state, meaning that there would be six columns (Male/Female for White, Black, and Other) for each year and state matchup. The dataframe was pivoted to make it more usable for plotting and modelling.
https://www.census.gov/data/tables/time-series/demo/popest/1970s-state.html
race1970s = pd.read_excel('race-1970s.xlsx')
race1970s.columns = race1970s.iloc[0]
race1970s = race1970s[1:]
race1970s = race1970s.drop('FIPS State Code', axis = 1)
race1970s = race1970s[(race1970s['Year of Estimate'] == '1976')]
race1970s['gender'] = race1970s['Race/Sex Indicator'].apply(lambda x: 'Female' if 'female' in x else 'Male')
race1970s['Race/Sex Indicator'] = race1970s['Race/Sex Indicator'].apply(lambda x: x.split()[0])
race1970s['Year of Estimate'] = race1970s['Year of Estimate'].apply(lambda x: int(x))
cols = list(race1970s.columns[3:-1])
race1970s['total'] = race1970s[cols].sum(1)
race1970s = race1970s.drop(cols, 1)
race1970s.columns = ['year', 'state', 'race', 'gender', 'total']
race1970s
race1980s = pd.read_excel("race-1980s.xls")
race1980s.columns = race1980s.iloc[0]
race1980s = race1980s[1:]
race1980s['gender'] = race1980s['Race/Sex Indicator'].apply(lambda x: 'Female' if 'female' in x else 'Male')
race1980s['Race/Sex Indicator'] = race1980s['Race/Sex Indicator'].apply(lambda x: x.split()[0])
cols = list(race1980s.columns[3:-1])
race1980s['total'] = race1980s[cols].sum(1)
race1980s = race1980s.drop(cols, 1)
race1980s['FIPS State and County Codes'] = race1980s.apply(lambda x: fips_to_state(x['FIPS State and County Codes']), axis =1)
race1980s.columns = ['year', 'state', 'race', 'gender', 'total']
race1980s = race1980s[race1980s['state'] != 'NaN']
race1980s = race1980s.groupby(['year', 'state', 'race', 'gender'], as_index=False).agg({'total': 'sum'})
race1980s
race1992 = pd.read_excel('race-1990s-present.xls')
race1992 = race1992[race1992['Notes'] != 'Total']
race1992 = race1992.drop(['Notes', 'Yearly July 1st Estimates Code', 'State Code', 'Race Code', 'Gender Code'], axis=1)
race1992 = race1992.dropna()
race1992
race1992['Race'] = race1992['Race'].apply(lambda x: x if 'Indian' not in x and 'Asian' not in x else 'Other')
race1992['Race'] = race1992['Race'].apply(lambda x: x if 'American' not in x else 'Black')
race1992.columns = ['state', 'gender', 'race', 'year', 'total']
race1992 = race1992.groupby(['year', 'state', 'race', 'gender'], as_index = False).agg({'total': 'sum'})
race1992
socioeconomic = pd.concat([race1970s, race1980s, race1992], ignore_index = True)
socioeconomic
socioeconomic['ragender'] = socioeconomic['race'] + socioeconomic['gender']
socioeconomic = socioeconomic.drop(['race', 'gender'], axis = 1)
socioeconomic = socioeconomic.pivot(columns = ['ragender'], values = ['total'], index = ['year', 'state'])
socioeconomic = socioeconomic.reset_index(level = [0, 1])
socioeconomic.columns = socioeconomic.columns.map(''.join)
socioeconomic.columns = ['year', 'state', 'BlackFemale', 'BlackMale', 'OtherFemale', 'OtherMale', 'WhiteFemale', 'WhiteMale']
socioeconomic
poverty = pd.read_excel("poverty-by-state.xlsx")
poverty.columns = ['state', 'year', 'total', 'poor', 'percent']
poverty = poverty[1:]
poverty['percent'] = poverty['percent'] / 100
poverty.columns = ['state', 'year', 'total_families', 'poor_families', 'percent']
poverty
A useful dataframe for our analysis is calculating the candidate winners for each state per each election year (see the map below). The original dataset was aggregated to keep only the candidate who won the state, by selecting the entry with maximum votes.
agg_scheme = {
'party': 'first',
'cand': 'first',
'cand_votes': 'max',
'tot_votes': 'first',
'state_code': 'first'
}
wins_by_state = cand_votes.groupby(['year', 'state'], as_index=False).agg(agg_scheme)
wins_by_state['party_class'] = wins_by_state['party'].apply(lambda x: 0 if x == 'democrat' else 1)
wins_by_state.columns = ['year', 'state', 'party', 'cand', 'cand_votes', 'tot_votes', 'state_code', 'party_class']
wins_by_state
socioeconomic = pd.concat([race1970s, race1980s, race1992], ignore_index = True)
socioeconomic
poverty = pd.read_excel("poverty-by-state.xlsx")
poverty.columns = ['state', 'year', 'total', 'poor', 'percent']
poverty = poverty[1:]
poverty['percent'] = poverty['percent'] / 100
poverty
state_voted_for = pop_cand.copy(deep = True)
state_voted_for = state_voted_for.groupby(['year', 'state']).agg({'candidatevotes': 'max', 'party': 'first'}).reset_index().drop('candidatevotes', axis=1)
socioeconomic['ragender'] = socioeconomic['race'] + socioeconomic['gender']
socioeconomic = socioeconomic.drop(['race', 'gender'], axis = 1)
socioeconomic = socioeconomic.pivot(columns = ['ragender'], values = ['total'], index = ['year', 'state'])
socioeconomic = socioeconomic.reset_index(level = [0, 1])
socioeconomic.columns = socioeconomic.columns.map(''.join)
socioeconomic.columns = ['year', 'state', 'BlackFemale', 'BlackMale', 'OtherFemale', 'OtherMale', 'WhiteFemale', 'WhiteMale']
socioeconomic
state_voted_for
socioeconomic = socioeconomic.merge(state_voted_for, on = ['year', 'state'])
poverty.columns = ['state', 'year', 'total_families', 'poor_families', 'percent']
socioeconomic = socioeconomic.merge(poverty, on=['state', 'year'])
socioeconomic
There are many parties that exist in the United States and it is crucial to know which parties represent most of the votes. We plotted a bar graph below to understand this at a larger scale. We can clearly see that the democratic and republican parties represent most of the candidate votes and therefore, will be the parties that are analyzed in this entire project.
parties = cand_votes.groupby(['party']).agg({'cand_votes': 'sum'})
parties.nlargest(20, 'cand_votes').plot(kind='bar', figsize=(20, 10))
valid_parties = ['democrat', 'republican']
cand_votes = cand_votes[cand_votes['party'].isin(valid_parties)]
wins_by_state = wins_by_state[wins_by_state['party'].isin(valid_parties)]
scl = [[0, '#2980b9'],[1, '#e74c3c']]
def plot_votes_map(arg):
data_slider = []
for year in ELECTION_YEARS:
df = wins_by_state[(wins_by_state['state'] != 'District of Columbia') & (wins_by_state['year'] == year)]
by_year = dict(
type='choropleth',
locations=df['state_code'],
z=df[arg].astype(float),
locationmode='USA-states',
colorscale=scl,
)
data_slider.append(by_year)
steps = []
count = 1976
for i in range(len(data_slider)):
step = dict(method='restyle',
args=['visible', [False] * len(data_slider)],
label='Year {}'.format(count)) # label to be displayed for each step (year)
step['args'][1][i] = True
steps.append(step)
count += 4
sliders = [dict(active=10, pad={"t": 1}, steps=steps)]
layout = dict(
geo=dict(scope='usa', projection={'type': 'albers usa'}),
sliders=sliders
)
fig = dict(data=data_slider, layout=layout)
plotly.offline.iplot(fig, show_link=False)
To analyze the roles of states in the electoral college, it’s best to see the election results on our interactive map (use the slider to go across years). One thing to point out is the 1992 election, it seems that Ronald Reagan held a landslide victory; however, we will soon see that this could not be further from the truth.
plot_votes_map('party_class')
To analyze the winning candidate’s influence on each state, the original dataset was filtered to contain only the winning candidates votes for each state (i.e., only Trump for the 2016 election, etc.). Next, an important statistic for analysis, win proportion, was calculated. Essentially, it’s just the votes for a candidate divided by the total number of votes.
def win_or_lose(row):
return 'W' if row['year'] in is_party_winner[row['party']] else 'L'
# Show distribution of votes by state
misleading = cand_votes.copy(deep=True)
misleading['outcome'] = misleading.apply(win_or_lose, axis=1)
misleading = misleading[misleading['outcome'] == 'W'].drop('outcome', axis=1)
misleading['win_prop'] = misleading['cand_votes']/misleading['tot_votes']
misleading
Calculating this statewide on each year, we can see how the winning distribution is actually centered in each of these elections. A simple scatterplot shows that there seems to be a bit of normality of win proportions for a candidate in their election years, with a few persistent outliers. It can be inferred that states are more divided than we think they are.
fig, ax = plt.subplots(figsize=(10,5))
ax.scatter(misleading['year'], misleading['win_prop'])
To better see this relationship, a violin plot of the same data was constructed. It’s quite evident that there seems to be a normal distribution in the earlier years, but it flattens out in recent years. The effect of the outlier can also be seen.
# More centered in earlier years, but states differ by a lot in the future; mention outliers
sns.violinplot(x = misleading['year'], y = misleading['win_prop'])
To attempt to minimize some of this noise, we will standardize the win proportion with respect to each election year. Already, you can see that the standard deviations are larger in later years in comparison to earlier years.
# Standardize win_proportion
wp_standardized = misleading.groupby('year', as_index = False).agg({'year': 'first', 'win_prop': ['mean', 'std']})
wp_standardized.columns = wp_standardized.columns.map('|'.join).str.strip('|')
wp_standardized
# calculate standardized for EACH YEAR
values = []
for index, row in misleading.iterrows():
election_year = wp_standardized[wp_standardized['year|first'] == row['year']]
mean = float(election_year['win_prop|mean'])
std = float(election_year['win_prop|std'])
calc = (row['win_prop'] - mean)/std
values.append(calc)
misleading['stand_win_prop'] = values
misleading
By standardizing, we can easily find which states seem to be the outliers. Unsurprisingly, the majority of the outliers are the District of Columbia, which results in a resounding democrat vote for each election year. Additionally, we can see that Utah opposes the District of Columbia by sticking with the Republican party.
outliers = misleading[(misleading['stand_win_prop'] < -2) | (misleading['stand_win_prop'] > 2)]
outliers
Overall, the votes don’t seem to sway one way or another. According to the boxplot, the median is perfectly in the middle, and the IQ ranges seem to be rather symmetrical.
# In general, there is very little deviation between candidates, and the outliers are relatively consistent
fig, ax = plt.subplots(figsize=(10,5))
ax = sns.boxplot(x=misleading['stand_win_prop'])
ax.axes.set_title("Standardized Win Proportion", fontsize=15)
The new violinplot shows that even when accounting for outliers, there seems to be slightly more deviation between states in recent years than in prior years. It also shows that the majority of distributions for each election year tend to stay very consistent, and no candidate really swings the majority of America.
fig, ax = plt.subplots(figsize=(10,10))
ax = sns.violinplot(x= misleading['year'], y= misleading['stand_win_prop'])
ax.axes.set_title("Win proportion over Time", fontsize=15)
ax.set_xlabel("Year", fontsize=15)
ax.set_ylabel("Standardized Win Proportion", fontsize=15)
The boxplot shows another representation of the standardized data. It does support the violinplot by showing that no candidate outright has a ‘landslide victory’, but it does show that even the deviation in recent years is not as profound. The deviation tends to increase in the 25% range, but the 75% range seems to stay relatively consistent.
fig, ax = plt.subplots(figsize=(10,10))
ax = sns.boxplot(x='year', y='stand_win_prop', data=misleading)
ax.axes.set_title("Win proportion over Time", fontsize=15)
ax.set_xlabel("Year", fontsize=15)
ax.set_ylabel("Standardized Win Proportion", fontsize=15)
Based solely on popularity, it does not seem that there is much sway between elections. However, when splitting wins up by state we can see where common misconceptions arise. In the next step of analysis, we can see that even though the popular votes for each election were pretty evenly split, the statewide splits are very far off.
democrat_year_wins = wins_by_state.groupby('year', as_index=False).agg({'party_class': 'sum'})
democrat_year_wins['dem_avg'] = democrat_year_wins['party_class'] / 51
fig, ax = plt.subplots(figsize=(10,10))
ax.bar(x='year', height='dem_avg', data=democrat_year_wins)
ax.set_xticks(ticks=np.arange(1976, 2017, 4))
ax.set_xlabel("Year", fontsize=15)
ax.set_ylabel("Democrat States", fontsize=15)
ax.set_title('Electoral Votes vs Representative Electoral Votes', fontsize=15)
democrat_state_wins = wins_by_state.groupby('state', as_index=False).agg({'party_class': 'sum'})
democrat_state_wins['dem_avg'] = democrat_state_wins['party_class'] / len(ELECTION_YEARS)
Because of this, on average, we can see that states tended to vote Republican more than Democrat (the majority of the winners from 1976-2016 presidential elections are Republicans). However, as we saw before, this is a poor representation of the voting population.
# Reversed 1 and 0 because of party_class
# Biased towards republicans, this makes sense becxause the election year data consisted of republican winners
fig, ax = plt.subplots(figsize=(10, 5))
ax = sns.boxplot(x = democrat_state_wins['dem_avg'])
ax.set_title('Probability of States Voting Republican')
ax.set_xlabel('Probability')
To analyze the effect of the electoral college on elections, each candidate’s winning proportion in each state has been plotted on the number of electoral votes that state won over. As can be seen, the majority of large electoral states have been clustered towards the 50% win margin. Thus, a lot of the skew occurs because all of the minority candidate’s votes essentially disappear in the “winner takes all” system.
# merge electoral votes to previous dataframe
def win(row):
return row['elec_votes'] if row['year'] in is_party_winner[row['party']] else 0
agg_scheme = {
'candidatevotes': 'max',
'votes': 'first',
'totalvotes': 'first',
'party': 'first'
}
popular_electoral = wins_by_state.merge(updated_elec_votes, on=['state', 'year'])
popular_electoral['won_elec_votes'] = popular_electoral.apply(win, axis=1)
popular_electoral['win_prop'] = popular_electoral['cand_votes'] / popular_electoral['tot_votes']
popular_electoral['exp_elec_votes'] = popular_electoral['win_prop'] * popular_electoral['elec_votes']
popular_electoral.head(3)
i = 0
j = 0
fig, ax = plt.subplots(nrows=4, ncols=3)
fig.add_subplot(111, frameon=False)
plt.tick_params(labelcolor='none', top=False, bottom=False, left=False, right=False)
fig.set_figheight(20)
fig.set_figwidth(20)
fig.delaxes(ax[3][2])
plt.xlabel('Win Proportion', fontsize=20)
plt.ylabel('Electoral Votes', fontsize=20)
for year in ELECTION_YEARS:
election_year = popular_electoral[popular_electoral['year'] == year].reset_index()
ax[i][j].scatter(election_year['win_prop'], y = election_year['elec_votes'])
ax[i][j].set_title(f'{year}')
if i == 3:
j = j + 1
i = 0
else:
i = i + 1
# aggregate by year
agg_scheme = {
'won_elec_votes': 'sum',
'cand_votes': 'sum',
'tot_votes': 'sum',
'exp_elec_votes': 'sum'
}
pop_elec_year = popular_electoral.groupby('year', as_index=False).agg(agg_scheme)
pop_elec_year['win_prop'] = pop_elec_year['cand_votes'] / pop_elec_year['tot_votes']
pop_elec_year['elec_prop'] = pop_elec_year['won_elec_votes'] / TOTAL_ELEC_VOTES
pop_elec_year = pop_elec_year.drop(['tot_votes', 'cand_votes'], axis=1)
pop_elec_year
Further aggregating it by year, it can be seen that there seems to be more of an exponential curve between a candidate’s popular vote win vs their electoral win.
fig, ax = plt.subplots(figsize=(7,5))
ax.scatter(data=pop_elec_year, x = 'win_prop', y = 'elec_prop')
ax.set_xlabel('Popular Vote Proportion', fontsize=12)
ax.set_ylabel('Electoral Vote Proportion', fontsize=12)
ax.set_title('Popular Vote on Electoral Vote from 1976-2016', fontsize=15)
However, by converting a candidate’s electoral votes based on their popular vote (E(electoral votes) = win proportion * electoral_votes) for each state, and aggregating them countrywide, there is a more nuanced outcome.
def label(rects):
for rect in rects:
height = rect.get_height()
ax.annotate('{}'.format(height),
xy=(rect.get_x() + .4, height),
xytext=(0, 3),
textcoords='offset points',
ha='center', va='bottom')
def tick_size(ticks):
for tick in ticks:
tick.label.set_fontsize(14)
pop_elec_year['round_exp'] = pop_elec_year['exp_elec_votes'].apply(lambda x: int(x))
pop_elec_year['round_act'] = pop_elec_year['won_elec_votes'].apply(lambda x: int(x))
pop_elec_year['round_prop'] = pop_elec_year['win_prop'].apply(lambda x: int(x * 538))
fig, ax = plt.subplots(figsize=(18,10))
rep = ax.bar(x=pop_elec_year['year'] - 1, height=pop_elec_year['round_exp'], label='Representative')
act = ax.bar(x=pop_elec_year['year'], height=pop_elec_year['round_act'], label='Actual')
prop = ax.bar(x=pop_elec_year['year'] + 1, height=pop_elec_year['round_prop'], label='Popular Vote (scaled)')
ax.set_xticks(ticks=np.arange(1976, 2017, 4))
label(rep)
label(act)
label(prop)
tick_size(ax.xaxis.get_major_ticks())
tick_size(ax.yaxis.get_major_ticks())
ax.axhline(y=270, ls='--', c='black')
ax.set_xlabel('Year', fontsize=17)
ax.set_ylabel('Electoral Votes', fontsize=14)
ax.set_title('Electoral Votes vs Representative Electoral Votes', fontsize=15)
ax.legend(loc='upper right', fontsize=17)
socioeconomic['region'] = socioeconomic['state'].apply(lambda x: state_region[us_state_abbrev[x]])
soc_reg = socioeconomic.merge(wins_by_state[['year', 'state', 'state_code', 'party_class']], on=['year', 'state'])
soc_reg = soc_reg.merge(updated_elec_votes, on=['year', 'state'])
soc_reg['total_pop'] = soc_reg['BlackMale'] + soc_reg['BlackFemale'] + soc_reg['WhiteMale'] + soc_reg['WhiteFemale'] + soc_reg['OtherMale'] + soc_reg['OtherFemale']
soc_reg
First, looking at the distribution of race on votes, one can see that in winning states, white and black people tend to vote for Republicans (in recent years). Keep in mind, since we weren't able to find voter turnout data, this is just demographics of states that voted for each party. Since minorities tend to vote Democrat, this may be evidence of some sort of voter suppression.
def split(group):
return re.sub(r"(\w)([A-Z])", r"\1 \2", group)
fig, ax = plt.subplots(nrows=3, ncols=2, figsize=(20,15))
cols =[['WhiteFemale', 'WhiteMale'], ['BlackFemale', 'BlackMale'], ['OtherFemale', 'OtherMale']]
i = 0
j = 0
while j < 2:
plot = sns.violinplot(ax=ax[i][j], x='year', y=cols[i][j], hue='party', data=socioeconomic, split=True)
plot.set_title(f'Distribution of Votes for {split(cols[i][j])}s')
plot.set_xlabel('Year')
if i == 2:
i = 0
j = j + 1
else:
i = i + 1
race_df = soc_reg
race_df['black'] = race_df['BlackFemale'] + race_df['BlackMale']
race_df['white'] = race_df['WhiteFemale'] + race_df['WhiteMale']
race_df['other'] = race_df['OtherFemale'] + race_df['OtherMale']
race_df['black_std'] = race_df['black'] / race_df['total_pop']
race_df['white_std'] = race_df['white'] / race_df['total_pop']
race_df['other_std'] = race_df['other'] / race_df['total_pop']
race_df = race_df.drop(columns=['total_families', 'poor_families', 'BlackFemale', 'BlackMale', 'WhiteFemale', 'WhiteMale', 'OtherFemale', 'OtherMale'])
for i, row in race_df.iterrows():
race_df.loc[i, 'state_code'] = us_state_abbrev[row['state']]
race_df.nlargest(10, ['black'])
def plot_race_map(arg, r):
data_slider = []
for year in wins_by_state.year.unique():
df = race_df[(race_df['year'] == year)]
df = df.nlargest(10, [r])
df['text'] = df[r]
by_year = dict(
type='choropleth',
locations=df['state_code'],
z=df[arg].astype(float),
locationmode='USA-states',
colorscale=scl,
text=df['text']
)
data_slider.append(by_year)
steps = []
count = 0
count = 1976
for i in range(len(data_slider)):
step = dict(method='restyle',
args=['visible', [False] * len(data_slider)],
label='Year {}'.format(count) # label to be displayed for each step (year)
)
step['args'][1][i] = True
steps.append(step)
count += 4
sliders = [dict(active=10, pad={"t": 1}, steps=steps)]
layout = dict(
geo=dict(scope='usa', projection={'type': 'albers usa'}),
sliders=sliders
)
fig = dict(data=data_slider, layout=layout)
plotly.offline.iplot(fig, show_link=False)
Over here, we can see that states with the largest Black proportion of Black people are in the South and East Coast. However, since, as previously mentioned, the majority of Black voters are democrats, it is odd to see this. This could be due to low voter turnout or even voter suppression. https://www.pewresearch.org/politics/2018/03/20/1-trends-in-party-affiliation-among-demographic-groups/
plot_race_map('party_class', 'black_std')
The states with the highest density of White people seem to reside in the Midwest and NorthEast. Since White people are generally split on democrats and republicans, this does make sense and shows where White Democrats and where White Republicans reside. https://www.pewresearch.org/politics/2018/03/20/1-trends-in-party-affiliation-among-demographic-groups/
plot_race_map('party_class', 'white_std')
This shows that 'Other', including Hispanic, Pacific Islander, Asian, and Native Americans, live in the West Coast and in New York/New Jersey (most likely around NYC). The Hispanic and Asian population are usually overwhelmingly democratic voters. https://www.pewresearch.org/politics/2018/03/20/1-trends-in-party-affiliation-among-demographic-groups/
plot_race_map('party_class', 'other_std')
Previously, we saw a bar graph that showed the representative votes and the actual votes (electoral). Before, with the three-pronged bar graph, we saw that the expected value of electoral votes was not exactly the same as the win percentage of a certain candidate (scaled to 538). The slight difference between how the electoral votes were actually divided are perplexing, so plots of population of states and their respective vote counts were created to see any sort of trend with certain states. We divided these plots into 8-year intervals to see how exactly the mean population for each state affected a state’s electoral votes.
soc_reg['total_pop'] = soc_reg['black'] + soc_reg['white'] + soc_reg['other']
len(soc_reg.columns)
total_pop = wins_by_state.merge(updated_elec_votes, on=['year', 'state'])
total_pop['total_pop'] = soc_reg['total_pop']
total_pop['period_intervals'] = pd.cut(total_pop['year'], 5)
eight_year_average = pd.DataFrame({
'average_pop': total_pop.groupby(['period_intervals', 'state'])['total_pop'].mean(),
'average_evotes': total_pop.groupby(['period_intervals', 'state'])['elec_votes'].mean()
}).dropna().reset_index()
eight_year_average
lst = sorted(list(set(eight_year_average['period_intervals'])))
lst
def plot_intervals(interval_num, year_range):
fig, ax = plt.subplots(figsize=(20, 15))
interval = eight_year_average[eight_year_average['period_intervals'] == lst[interval_num]]
for i, row in interval.iterrows():
ax.plot(row['average_pop'], row['average_evotes'], 'o')
plt.annotate(row['state'], (row['average_pop'], row['average_evotes']))
model = np.polyfit(interval['average_pop'], interval['average_evotes'], 1)
predict = np.poly1d(model)
x_lin_reg = list(interval['average_pop'])
y_lin_reg = predict(x_lin_reg)
plt.xlabel('Mean Population')
plt.ylabel('Electoral Votes')
plt.title(f'Electoral Votes: {year_range}')
plt.plot(x_lin_reg, y_lin_reg, c='black')
plt.show()
plot_intervals(0, '1976 - 1984')
plot_intervals(1, '1984 - 1992')
plot_intervals(2, '1992 - 2000')
plot_intervals(3, '2000 - 2008')
plot_intervals(4, '2008 - 2016')
Over the five plots, it is quite evident that the population of a state has an effect on the electoral votes of that state. While California is always seen at the top with the most number of electoral votes, we see the number of electoral votes of some states, such as Texas, increasing over time. This is due to the increase in population in that state. New York’s electoral votes, however, seem to have a slight decrease over time although the population remained the same. To further confirm our assumption of electoral votes increasing with the population of a state, we will now perform a hypothesis test.
To examine the relationship between the number of votes per states and the state's total population, we shall conduct a hypothesis test.
from statsmodels.formula.api import ols
average = pd.DataFrame({
'average_pop': total_pop.groupby(['state'])['total_pop'].mean(),
'average_evotes': total_pop.groupby(['state'])['elec_votes'].mean()
}).dropna().reset_index()
reg = ols(formula='average_evotes ~ average_pop', data=average).fit()
print(reg.summary())
Using the statsmodel library, we see the regression results of the dataset that takes into account the mean population per state(x-value) and the number of electoral votes per state(y-value). We take $\beta_1$ to be the coefficient of the x-value. We are testing this at a 5% significance level, therefore, $\alpha = 0.05$.
$H_0$: $\beta_1$ = 0 (null)
$H_a$: $\beta_1 \neq 0$ (alternate)
If the p-value of $\beta_1$ (average_pop) is greater than the significance level, we fail to reject the null hypothesis.
If the p-value of $\beta_1$ (average_pop) is less than the significance level, we reject the null hypothesis.
p-value = 0.000
We clearly see that p-value $< \alpha$.
Since we found the p-value to be less than the significance level, we reject the null hypothesis. This means that we can conclude by saying there is a strong linear relationship between the population of a state and the electoral votes for that state.
In this section, we will try and predict the results for the 2020 election and compare it with the outcome. As we all know, this year, the democratic party won the election. Since the 2020 census data has not yet been made available, we predict it on the 2019 census data. This is the closest representation to the current census. The data has been extracted from
We initially clean the dataset to count the total population per state for each race. We then train the model using the popular candidate dataset. The party class column in this dataset represents the outcome of the state (democrat or republican). hence, this is used as the y value in the model. The model is trained using states, race, and region as the features and can be seen in the 2019 census data as well. After training the model with the existing data with the outcomes for each state, we predict the outcome using the 2019 census data.
0 is classified as republican while 1 is classified as democrat. The features included are: race population (black, white, other), states (dummy), region (dummy), total population
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
pop_2019 = pd.read_csv('raw_data_2019.csv')
pop_2019 = pop_2019.head(50)
pop_2019 = pop_2019.replace(np.nan, 0)
pop_2019['other'] = pop_2019['Hispanic'] + pop_2019['Asian'] + pop_2019['American Indian/Alaska Native'] + \
pop_2019['Native Hawaiian/Other Pacific Islander'] + pop_2019['Multiple Races']
pop_2019 = pop_2019.drop(columns=[
'Hispanic',
'Asian',
'American Indian/Alaska Native',
'Native Hawaiian/Other Pacific Islander',
'Multiple Races',
'Total'
])
pop_2019 = pop_2019.rename(columns={
'White': 'white',
'Black': 'black',
'Location': 'state'
})
pop_2019['region'] = pop_2019['state'].apply(lambda x: state_region[us_state_abbrev[x]])
pop_2019['total_pop'] = pop_2019['white'] + pop_2019['black'] + pop_2019['other']
data = soc_reg[soc_reg['state'] != 'District of Columbia']
data = pd.get_dummies(data, columns=['state', 'region'], drop_first=True)
data['black'] = data['BlackFemale'] + data['BlackMale']
data['white'] = data['WhiteFemale'] + data['WhiteMale']
data['other'] = data['OtherFemale'] + data['OtherMale']
data = data.drop(columns=['black_std', 'white_std', 'other_std', 'year', 'party', 'total_families', 'poor_families', 'BlackFemale', 'BlackMale', 'WhiteFemale', 'WhiteMale', 'OtherFemale', 'OtherMale'])
X = data.drop(columns=['party_class', 'percent', 'state_code', 'elec_votes'])
y = np.array(data["party_class"])
X_test = pd.get_dummies(pop_2019, columns=['state', 'region'], drop_first=True)
clf = DecisionTreeClassifier(criterion="entropy", random_state = 0).fit(X, y)
tree_y_predicted = clf.predict(X_test)
electoral_votes_2016 = wins_by_state[wins_by_state['year'] == 2016]
electoral_votes_2016 = electoral_votes_2016.reset_index().drop(columns=['index'])
electoral_votes_2016 = electoral_votes_2016.merge(updated_elec_votes, on=['year', 'state'])
After getting the predicted outcomes for each state, we are trying to count the number of electoral votes for republicans vs democrats.
republican_votes = 0
democrat_votes = 0
for i in range(len(tree_y_predicted)):
if tree_y_predicted[i] == 0:
republican_votes += electoral_votes_2016.loc[i, 'elec_votes']
else:
democrat_votes += electoral_votes_2016.loc[i, 'elec_votes']
print(f'Republicans: \t{republican_votes} electoral votes,\nDemocrats: \t{democrat_votes} electoral votes')
We find that the democrats beat the republicans (291 vs 244) which is correct according to the 2020 election results. The prediction was very close to the actual election results: Joe Biden got 306 electoral votes whereas Donald Trump earned 232 electoral votes. This classification essentially tells us how the population and race for each state helps us to determine or predict the election results.
Through this entire project, we got a good understanding of how the electoral college works: the roles of population, race, and electoral votes in the election. By exploring the data, we understood that even if a candidate does not have the majority of the popular votes, the candidate can still win the election. In the project, it is evident that the “winner takes all” system works for many of the candidates and we saw that through the regression analysis as well as the hypothesis test, identifying that the number of electoral votes per state is related to the population of a state. We also explored the states that have the most impact on the election as well as the outliers. This gave us more insight and also led us to our next step of analyzing the election data related to the population grouped by the race per state. Analyzing all of these took us back to the claims we made in the introduction of criticizing the electoral college.